This is an interactive notebook. You can run it locally or use the links below:
Leaderboard Quickstart
In this notebook we will learn to use Weave’s Leaderboard to compare model performance across different datasets and scoring functions. Specifically, we will:- Generate a dataset of fake zip code data
- Author some scoring functions and evaluate a baseline model.
- Use these techniques to evaluate a matrix of models vs evaluations.
- Review the leaderboard in the Weave UI.
Step 1: Generate a dataset of fake zip code data
First we will create a functiongenerate_dataset_rows
that generates a list of fake zip code data.
Step 2: Author scoring functions
Next we will author 3 scoring functions:check_concrete_fields
: Checks if the model output matches the expected city and state.check_value_fields
: Checks if the model output is within 10% of the expected population and median income.check_subjective_fields
: Uses a LLM to check if the model output matches the expected “known for” field.